Guiding solution. An analysis of the nycflights13 datasets. Mandatory project report in Tools for Analytics (R part).
We consider the datasets available from the package nycflights13 that contains information about every flight that departed from New York City in 2013. Let us have a look at the datasets. First, we load the packages need for this report:
The datasets in the nycflights13 package are:
| Dataset | Description |
|---|---|
| airlines | Airline names. |
| airports | Airport metadata |
| flights | Flights data |
| planes | Plane metadata. |
| weather | Hourly weather data |
Let us try to do some descriptive analytics on the different datasets.
I this section we will focus on the flights data set, which lists all domestic flights out of the New York area in 2013. We run skim to get an overview:
skim(flights)
| Name | flights |
| Number of rows | 336776 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 14 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| carrier | 0 | 1.00 | 2 | 2 | 0 | 16 | 0 |
| tailnum | 2512 | 0.99 | 5 | 6 | 0 | 4043 | 0 |
| origin | 0 | 1.00 | 3 | 3 | 0 | 3 | 0 |
| dest | 0 | 1.00 | 3 | 3 | 0 | 105 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 2013.00 | 0.00 | 2013 | 2013 | 2013 | 2013 | 2013 | ▁▁▇▁▁ |
| month | 0 | 1.00 | 6.55 | 3.41 | 1 | 4 | 7 | 10 | 12 | ▇▆▆▆▇ |
| day | 0 | 1.00 | 15.71 | 8.77 | 1 | 8 | 16 | 23 | 31 | ▇▇▇▇▆ |
| dep_time | 8255 | 0.98 | 1349.11 | 488.28 | 1 | 907 | 1401 | 1744 | 2400 | ▁▇▆▇▃ |
| sched_dep_time | 0 | 1.00 | 1344.25 | 467.34 | 106 | 906 | 1359 | 1729 | 2359 | ▁▇▇▇▃ |
| dep_delay | 8255 | 0.98 | 12.64 | 40.21 | -43 | -5 | -2 | 11 | 1301 | ▇▁▁▁▁ |
| arr_time | 8713 | 0.97 | 1502.05 | 533.26 | 1 | 1104 | 1535 | 1940 | 2400 | ▁▃▇▇▇ |
| sched_arr_time | 0 | 1.00 | 1536.38 | 497.46 | 1 | 1124 | 1556 | 1945 | 2359 | ▁▃▇▇▇ |
| arr_delay | 9430 | 0.97 | 6.90 | 44.63 | -86 | -17 | -5 | 14 | 1272 | ▇▁▁▁▁ |
| flight | 0 | 1.00 | 1971.92 | 1632.47 | 1 | 553 | 1496 | 3465 | 8500 | ▇▃▃▁▁ |
| air_time | 9430 | 0.97 | 150.69 | 93.69 | 20 | 82 | 129 | 192 | 695 | ▇▂▂▁▁ |
| distance | 0 | 1.00 | 1039.91 | 733.23 | 17 | 502 | 872 | 1389 | 4983 | ▇▃▂▁▁ |
| hour | 0 | 1.00 | 13.18 | 4.66 | 1 | 9 | 13 | 17 | 23 | ▁▇▇▇▅ |
| minute | 0 | 1.00 | 26.23 | 19.30 | 0 | 8 | 29 | 44 | 59 | ▇▃▆▃▅ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| time_hour | 0 | 1 | 2013-01-01 05:00:00 | 2013-12-31 23:00:00 | 2013-07-03 10:00:00 | 6936 |
The variables in this dataset are:
year, month, day Date of departuredep_time,arr_time Actual departure and arrival times.sched_dep_time, sched_arr_time Scheduled departure and arrival times.dep_delay, arr_delay delays in minuteshour, minute Time of scheduled departurecarrier carrier abbreviationtailnum Tail number of plane.flight flight number.origin, dest Origin and Destinationair_time Time spent in air.distance Distance flown.time_hour scheduled date and hour of flight.For further details about the dataset see ?flights or the online documentation.
The skim output indicate that some flights are canceled. We remove these observations from the dataset:
Let us first try to do some mutating joins and combine variables from multiple tables. In one flights we have flight information with an abbreviation for carrier (carrier), and in airlines we have a mapping between abbreviations and full names (name). You can use a join to add the carrier names to the flight data:
dat <- dat %>%
left_join(airlines) %>%
rename(carrier_name = name) %>%
print()
# A tibble: 328,521 × 20
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 328,511 more rows, and 13 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, carrier_name <chr>
Note we here join by the column carrier represented in both data frames. That is, the default argument by = c("carrier" = "carrier") is used. If we want the full name of origin airport, we need to specify which one we want to join to since each flight has an origin and destination airport. Afterwards we do the same for the destination airport.
dat <- dat %>%
left_join(airports %>% select(faa, name),
by = c("origin" = "faa")) %>%
rename(origin_name = name) %>%
left_join(airports %>% select(faa, name),
by = c("dest" = "faa")) %>%
rename(dest_name = name) %>%
select(month, carrier_name, origin_name, dest_name, sched_dep_time, dep_delay, arr_delay, distance, tailnum) %>%
print()
# A tibble: 328,521 × 9
month carrier_name origin_name dest_name sched_dep_time dep_delay
<int> <chr> <chr> <chr> <int> <dbl>
1 1 United Air L… Newark Lib… George Bu… 515 2
2 1 United Air L… La Guardia George Bu… 529 4
3 1 American Air… John F Ken… Miami Intl 540 2
4 1 JetBlue Airw… John F Ken… <NA> 545 -1
5 1 Delta Air Li… La Guardia Hartsfiel… 600 -6
6 1 United Air L… Newark Lib… Chicago O… 558 -4
7 1 JetBlue Airw… Newark Lib… Fort Laud… 600 -5
8 1 ExpressJet A… La Guardia Washingto… 600 -3
9 1 JetBlue Airw… John F Ken… Orlando I… 600 -3
10 1 American Air… La Guardia Chicago O… 600 -2
# … with 328,511 more rows, and 3 more variables: arr_delay <dbl>,
# distance <dbl>, tailnum <chr>
We now have the flights data we need stored in the data frame dat. Let us try to answer some questions.
We first calculate a summary table:
dat %>%
count(origin_name, carrier_name, sort = TRUE) %>%
paged_table()
Let us visualize the numbers. First we facet by airport and use geom_bar:
dat %>%
ggplot(aes(carrier_name)) +
geom_bar() +
facet_grid(rows = vars(origin_name)) +
labs(
title = "Number of flights",
x = "Carrier",
y = "Flights"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
We can also compare the two categorical variables by using geom_count:
dat %>%
ggplot(aes(origin_name, carrier_name)) +
geom_count() +
labs(
title = "Number of flights",
y = "Carrier",
x = "Departure airport",
size = "Flights"
)
Finally, we can use a heatmap by using geom_tile. In this case, geom_tile doesn’t offer a way to calculate counts on it’s own, so we use the function count in our pipe:
dat %>%
count(origin_name, carrier_name) %>%
ggplot(aes(origin_name, carrier_name, fill = n)) +
geom_tile() +
labs(
title = "Number of flights",
y = "Carrier",
x = "Departure airport",
fill = "Flights"
)
Summaries are:
dat %>%
count(month, carrier_name, sort = TRUE) %>%
paged_table()
We will try to visualize the numbers using a line plot with carrier as color aesthetic:
dat %>%
count(month, carrier_name) %>%
ggplot(mapping = aes(x = month, y = n, color = carrier_name)) +
geom_line() +
geom_point() +
geom_dl(aes(label = carrier_name), method = list(dl.trans(x = x + .3), "last.bumpup")) +
scale_x_continuous(breaks = 1:12, limits = c(1,17)) +
labs(
title = "Number of flights",
y = "Flights",
x = "Month"
) +
theme(legend.position = "none")
Note that delays columns are in minutes. We first convert them to hours:
dat <- dat %>%
mutate(across(contains("delay"), ~ .x / 60))
Next, we answer the question by looking at different measures.
Let us first have a look at the average departure delay by airline. The dplyr package has two functions that make it easy to do that: the group_by and the summarize functions. We use the two together and groups the rows of the dataset together based on the carrier and then uses summarise and the mean function to calculate the average delay:
dat %>%
group_by(carrier_name) %>%
summarise(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
arrange(desc(ave_delay)) %>%
paged_table()
Note the mean function have a na.rm argument which ignores the missing values otherwise the average delays could not be calculated. We can visualize our summary (a continuous-categorical comparison) by piping the table into a column plot:
dat %>%
group_by(carrier_name) %>%
summarise(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(carrier_name, ave_delay)) +
geom_col()
To get a better visualization we reorder the categorical x-axis by average delay, use the full names of the airlines (which are rotated) and add some informative labels:
dat %>%
group_by(carrier_name) %>%
summarise(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(reorder(carrier_name, ave_delay), 60 * ave_delay)) +
geom_col() +
labs(
title = "Average departure delay for each carrier",
x = "Carrier",
y = "Delay (minutes)"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
To conclude, Frontier (F9) and Express Jet (EV) have the highest average delay. However, using mean to summarize a value can be dangerous, because it’s sensitive to outliers!
We should always ask about the variation in the variables in our data sets, but it’s especially important to do so if we’re going to use averages to summarize them.
First let us calculate the standard deviation for each carrier:
dat %>%
group_by(carrier_name) %>%
summarise(ave_delay = mean(dep_delay, na.rm = TRUE), std = sd(dep_delay, na.rm = TRUE)) %>%
arrange(desc(std)) %>%
paged_table()
What is the distribution of departure delays by airline? Visualized as a density distribution using carrier as fill aesthetic:
dat %>%
ggplot(aes(dep_delay, fill = carrier_name)) +
geom_density(alpha = 0.5) +
labs(
title = "Departure delay desities for each carrier",
x = "Delay (hours)",
y = "Density",
fill = "Carrier"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
We can see that there is a small number of HUGE outliers which makes using mean possibly very misleading.
Lets us try to make a plot of the empirical cumulative distributions for each carrier using carrier as color aesthetic and a zoom of at most 3 hours delay:
dat %>%
ggplot() +
stat_ecdf(aes(x = dep_delay, color = carrier_name), alpha = 0.75) +
coord_cartesian(xlim = c(-0.1,3)) +
labs(
title = "Departure delay empirical cumulative distributions",
x = "Delay (hours)",
y = "Probability",
color = "Carrier"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Note, the higher upper-left the distribution is, the better. That is, a carrier dominates other carriers if the line is above the other carriers. Comparing this to the standard deviations, we see that the standard deviations is not a good measure for delays.
Variation in data like these where the outliers are very sparse is hard to visualize using density plots. We may also use a boxplot:
dat %>%
ggplot(aes(carrier_name, dep_delay)) +
geom_boxplot() +
labs(
title = "Variation in departure delay for each carrier",
x = "Carrier",
y = "Delay (hours)"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
We can see that most flights have a median around zero. However, some carriers have larger delays compared to others. Is the variation in departure delay different given departure airport? We use departure airport as color aesthetic:
dat %>%
ggplot(aes(carrier_name, dep_delay, color = origin_name)) +
geom_boxplot() +
labs(
title = "Variation in departure delay",
x = "Carrier",
y = "Delay (hours)",
color = "Departure airport"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
legend.position = "bottom")
This does not seem to be the case for most carriers.
The boxplot shows median values in the center. What would happen if we used median instead of average delay time and make a column plot?
dat %>%
group_by(carrier_name) %>%
summarise(median_delay = median(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(reorder(carrier_name, median_delay), median_delay)) +
geom_col() +
labs(
title = "Median departure delay for each carrier",
x = "Carrier",
y = "Median delay (hours)"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
That tells a bit of a different story! Fly SkyWest (OO) and you’ll get to leave six minutes early. Seemingly small, simple differences in the tools you choose when exploring data can lead to visualizations that tell very different stories.
How many flights were really delayed and how does that break down by airline carrier? Being delayed more than an hour really sucks, so let’s use that as our cutoff:
dat %>%
filter(dep_delay > 1)
# A tibble: 26,581 × 9
month carrier_name origin_name dest_name sched_dep_time dep_delay
<int> <chr> <chr> <chr> <int> <dbl>
1 1 Envoy Air La Guardia Charlotte… 630 1.68
2 1 American Air… John F Ken… Miami Intl 715 1.18
3 1 Envoy Air John F Ken… Baltimore… 1835 14.2
4 1 United Air L… Newark Lib… General E… 733 2.4
5 1 United Air L… La Guardia George Bu… 900 2.23
6 1 ExpressJet A… Newark Lib… Savannah … 944 1.6
7 1 Envoy Air La Guardia Minneapol… 1150 1.18
8 1 JetBlue Airw… John F Ken… Los Angel… 1220 1.28
9 1 ExpressJet A… La Guardia Memphis I… 1250 1.17
10 1 ExpressJet A… Newark Lib… Richmond … 1310 1.92
# … with 26,571 more rows, and 3 more variables: arr_delay <dbl>,
# distance <dbl>, tailnum <chr>
That’s a lot of flights! We can use the dplyr function named count to give us a summary of the number of rows of a that correspond to each carrier:
dat %>%
filter(dep_delay > 1) %>%
count(carrier_name, sort = TRUE)
# A tibble: 16 × 2
carrier_name n
<chr> <int>
1 ExpressJet Airlines Inc. 6861
2 JetBlue Airways 4571
3 United Air Lines Inc. 3824
4 Delta Air Lines Inc. 2651
5 American Airlines Inc. 2003
6 Envoy Air 1996
7 Endeavor Air Inc. 1966
8 Southwest Airlines Co. 1061
9 US Airways Inc. 766
10 Virgin America 363
11 AirTran Airways Corporation 314
12 Mesa Airlines Inc. 79
13 Frontier Airlines Inc. 73
14 Alaska Airlines Inc. 39
15 Hawaiian Airlines Inc. 10
16 SkyWest Airlines Inc. 4
Note that count has created a column named n which contains the counts and we ask it to sort that column for us.
We can visualize it with a column plot (note we don’t need to reorder because count has done that for us):
dat %>%
filter(dep_delay > 1) %>%
count(carrier_name, sort = TRUE) %>%
mutate(carrier_name = factor(carrier_name, levels = carrier_name, ordered = TRUE)) %>%
ggplot(aes(carrier_name, n)) +
geom_col() +
labs(
title = "Number of flights delayed more than one hour",
x = "Carrier",
y = "Flights"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
It seems that ExpressJet (EV) have a problem. They have a lot of very delayed flights.
We plot the delays against each other as points.
ggplot(data = dat, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point(alpha = 0.1) +
labs(
title = "Departure against arrival delay",
x = "Departure delay (hours)",
y = "Arrival delay (hours)"
)
The large mass of points near (0, 0) can cause some confusion since it is hard to tell the true number of points that are plotted. This is the result of a phenomenon called overplotting. As one may guess, this corresponds to points being plotted on top of each other over and over again. When overplotting occurs, it is difficult to know the number of points being plotted. We adjust the transparency of the points by setting alpha = 0.1
As we can see there is a linear relationship between the points departure delays result in arrival delays as expected. In general we can not fly faster to catch up with the delay.
If you’re flying out of New York you might want to know which airport has the worst delays on average. We first calculate median and average delays:
dat %>%
group_by(origin_name) %>%
summarize(ave_delay = mean(dep_delay, na.rm = TRUE), median_delay = median(dep_delay, na.rm = TRUE))
# A tibble: 3 × 3
origin_name ave_delay median_delay
<chr> <dbl> <dbl>
1 John F Kennedy Intl 0.202 -0.0167
2 La Guardia 0.172 -0.05
3 Newark Liberty Intl 0.252 -0.0167
As we can see La Guardia seems to have the smallest delays. However, the difference is small. Lets us try to make a plot of the empirical cumulative distributions for each airport using airport as color aesthetic and a zoom of at most 2 hours:
dat %>%
ggplot() +
stat_ecdf(aes(x = dep_delay, color = origin_name), alpha = 0.75) +
coord_cartesian(xlim = c(-0.1,2)) +
labs(
title = "Departure delay empirical cumulative distributions",
x = "Delay (hours)",
y = "Probability",
color = "Departure airport"
) +
theme(legend.position = "bottom")
The median values can be found at y = 0.5. Note that La Gaardia is above the other lines indicating that it has the smallest delays no matter what fractile we consider. Another way to visialize this covariation in a categorical (airport) and a continuous (delay) variable is with a boxplot. We use a little scaling to get a better picture of the average delay and zoom so the y variable is between at most half and hour.
dat %>%
ggplot(aes(origin_name, dep_delay)) +
geom_boxplot() +
coord_cartesian(ylim = c(-0.1, 0.5)) +
labs(
title = "Departure delay",
x = "Airport",
y = "Delay (hours)"
)
We first calculate median and average delays:
dat %>%
group_by(carrier_name, origin_name) %>%
summarize(ave_delay = mean(dep_delay, na.rm = TRUE), median_delay = median(dep_delay, na.rm = TRUE)) %>%
paged_table()
There are some differences. Let us try to do a heat map of the average delays:
dat %>%
group_by(origin_name, carrier_name) %>%
summarize(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(origin_name, carrier_name, fill = 60*ave_delay)) +
geom_tile() +
scale_fill_continuous(low = "#31a354", high = "#e5f5e0") +
labs(
title = "Average departure delays",
x = "Departure airport",
y = "Carrier",
fill = "Ave. delay (min)"
)
For each carrier this give a good insight into the differences at each airport. Another way to visualize the covariation is with a box plot. We use a little scaling to get a better picture of the delay and zoom so the delay is a most half an hour.
dat %>%
ggplot(aes(carrier_name, 60*dep_delay, fill = origin_name)) +
geom_boxplot() +
coord_cartesian(ylim = c(-10, 30)) +
labs(
title = "Departure delay",
x = "Carrier",
y = "Delay (min)",
fill = "Departure airport"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
dat %>%
ggplot() +
stat_ecdf(aes(x = dep_delay, color = origin_name), alpha = 0.75) +
coord_cartesian(xlim = c(-0.1, 1)) +
facet_wrap(vars(carrier_name)) +
labs(
title = "Departure delay empirical cumulative distributions",
x = "Delay (hours)",
y = "Probability",
color = "Departure airport"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
legend.position = "bottom")
First, note that the sched_dep_time is a number in the format HHMM. We convert it into a hour:minutes data type and afterwards to hours since midnight:
dat <- dat %>%
mutate(sched_dep_time = hm(str_replace(sched_dep_time, "^(.*)(..)$", "\\1:\\2"))) %>%
mutate(sched_dep_time = as.numeric(sched_dep_time)/60/60)
To explore covariation in two continuous (quantitative) variables, we can use a scatter plot:
dat %>%
ggplot(aes(sched_dep_time, dep_delay, color = origin_name)) +
geom_point(alpha = 0.1) +
# geom_smooth() +
labs(
title = "Departure delay given departure time",
y = "Delay (hours)",
x = "Departure time (hours after midnight) ",
color = "Departure airport"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
legend.position = "bottom")
Based on the plot there does not seem to be a clear effect for the different airports.
We use the patchwork package to plot distance against the two delays:
p1 <- dat %>%
ggplot(aes(x=distance, y= dep_delay)) +
geom_point(alpha = 0.1) +
geom_smooth() +
labs(
y = "Dept. delay (hours)",
x = "Distance"
)
p2 <- dat %>%
ggplot(aes(x=distance, y= arr_delay)) +
geom_point(alpha = 0.1) +
geom_smooth() +
labs(
y = "Arrival delay (hours)",
x = "Distance"
)
p1 + p2
Based on the plot there does not seem to be a clear effect for the different airports.
Let us do a mutation join so we have a bit more information about each airplane:
dat <- dat %>%
left_join(planes %>%
select(tailnum, plane_manufacturer = manufacturer, plane_model = model))
This could be useful for some kind of maintenance activity that needs to be done after x number of trips. The summary table is (based on tailnum):
dat %>%
count(tailnum, month) %>%
paged_table()
As an example, consider the plane N355NB:
dat1 <- dat %>%
filter(tailnum=="N355NB")
The specifications are:
filter(planes, tailnum=="N355NB")
# A tibble: 1 × 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N355NB 2002 Fixed w… AIRBUS A319… 2 145 NA Turbo…
We see that it is an Airbus 319 with 145 seats. The plane flew 124 flights in 2013 with a total distance of 1.03089^{5}.
Let us have a look at the destinations:
dat1 %>%
count(dest_name) %>%
ggplot(aes(x = reorder(dest_name, -n), y = n)) +
geom_col()
I this section we will focus on the weather data set, which lists hourly meterological data for LGA, JFK and EWR. We run skim to get an overview:
skim(weather)
| Name | weather |
| Number of rows | 26115 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 13 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| origin | 0 | 1 | 3 | 3 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 2013.00 | 0.00 | 2013.00 | 2013.00 | 2013.00 | 2013.00 | 2013.00 | ▁▁▇▁▁ |
| month | 0 | 1.00 | 6.50 | 3.44 | 1.00 | 4.00 | 7.00 | 9.00 | 12.00 | ▇▆▆▆▇ |
| day | 0 | 1.00 | 15.68 | 8.76 | 1.00 | 8.00 | 16.00 | 23.00 | 31.00 | ▇▇▇▇▆ |
| hour | 0 | 1.00 | 11.49 | 6.91 | 0.00 | 6.00 | 11.00 | 17.00 | 23.00 | ▇▇▆▇▇ |
| temp | 1 | 1.00 | 55.26 | 17.79 | 10.94 | 39.92 | 55.40 | 69.98 | 100.04 | ▂▇▇▇▁ |
| dewp | 1 | 1.00 | 41.44 | 19.39 | -9.94 | 26.06 | 42.08 | 57.92 | 78.08 | ▁▆▇▇▆ |
| humid | 1 | 1.00 | 62.53 | 19.40 | 12.74 | 47.05 | 61.79 | 78.79 | 100.00 | ▁▆▇▇▆ |
| wind_dir | 460 | 0.98 | 199.76 | 107.31 | 0.00 | 120.00 | 220.00 | 290.00 | 360.00 | ▆▂▆▇▇ |
| wind_speed | 4 | 1.00 | 10.52 | 8.54 | 0.00 | 6.90 | 10.36 | 13.81 | 1048.36 | ▇▁▁▁▁ |
| wind_gust | 20778 | 0.20 | 25.49 | 5.95 | 16.11 | 20.71 | 24.17 | 28.77 | 66.75 | ▇▅▁▁▁ |
| precip | 0 | 1.00 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 1.21 | ▇▁▁▁▁ |
| pressure | 2729 | 0.90 | 1017.90 | 7.42 | 983.80 | 1012.90 | 1017.60 | 1023.00 | 1042.10 | ▁▁▇▆▁ |
| visib | 0 | 1.00 | 9.26 | 2.06 | 0.00 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| time_hour | 0 | 1 | 2013-01-01 01:00:00 | 2013-12-30 18:00:00 | 2013-07-01 14:00:00 | 8714 |
For further details see View(weather) or read the associated help file by running ?weather to bring up the help file.
Observe that there is a variable called temp of hourly temperature recordings in Fahrenheit at weather stations near all three major airports in New York City: Newark (origin code EWR), John F. Kennedy International (JFK), and LaGuardia (LGA). Let us transform the temperature to celsius:
dat_w <- weather %>%
left_join(airports %>% select(faa, name),
by = c("origin" = "faa")) %>%
rename(origin_name = name) %>%
mutate(temp = (temp - 32) * (5/9) ) %>%
select(origin_name, time_hour, month, temp)
We start by plotting temperature over the year with airport/origin as color aesthetic. We also add a smoothing line:
dat_w %>%
ggplot(mapping = aes(x = time_hour, y = temp, color = origin_name)) +
geom_line(alpha = 0.2) +
geom_smooth(alpha = 0.25)
Note that we have used the alpha aesthetic to make the lines more transparent. There are many fluctuations; however, the temperature cycle from winter to summer is clear. Moreover, JFK seem to have a outlier in May (approx. -10 degrees) proberly due to a faulty measurement.
Let us start by plotting the density for each airport:
dat_w %>%
ggplot(mapping = aes(x = temp, fill = origin_name)) +
geom_density(alpha=0.75) +
geom_vline(
data = dat_w %>%
group_by(origin_name) %>%
summarise(m = mean(temp, na.rm = TRUE)),
mapping = aes(xintercept = m, color = origin_name)
)
Note the mean temparature is more or less the same (vertical lines). There is a bit fluctuations on Newark compared to for instance JFK airport (lowest spread).
A closer look can be done by faceting by month:
dat_w %>%
ggplot(mapping = aes(x = temp, fill = origin_name)) +
geom_density(alpha=0.5) +
facet_wrap(vars(month))
Finally, let us consider a boxplot of temparature for each month:
ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
geom_boxplot()
The resulting plot shows 12 separate boxplots side by side and illustrates the variablity and flucturations over the year.
What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.
The canceled flights are:
Give comments based on your intuition. Is the analysis valid?
A few examples on analysis:
We first get the full name of carriers by joining the canceled flights table and airlines table.
dat_c <- dat_c %>%
left_join(airlines) %>%
mutate(sched_dep_time = hm(str_replace(sched_dep_time, "^(.*)(..)$", "\\1:\\2"))) %>%
mutate(sched_dep_time = as.numeric(sched_dep_time)/60/60) %>%
print()
# A tibble: 8,255 × 20
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <dbl> <dbl> <int>
1 2013 1 1 NA 16.5 NA NA
2 2013 1 1 NA 19.6 NA NA
3 2013 1 1 NA 15 NA NA
4 2013 1 1 NA 6 NA NA
5 2013 1 2 NA 15.7 NA NA
6 2013 1 2 NA 16.3 NA NA
7 2013 1 2 NA 13.9 NA NA
8 2013 1 2 NA 14.3 NA NA
9 2013 1 2 NA 13.4 NA NA
10 2013 1 2 NA 15.8 NA NA
# … with 8,245 more rows, and 13 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, name <chr>
Number of canceled flights in each month:
dat_c %>%
ggplot(aes(factor(month))) +
geom_bar() +
labs(
title = "Number of canceled flights",
x = "Month",
y = "Flights"
)
Number of canceled flights over the day:
dat_c %>%
ggplot(aes(sched_dep_time)) +
geom_histogram(binwidth = 2) +
labs(
title = "Number of canceled flights",
x = "Hour",
y = "Flights"
)
Number of canceled flights per carrier:
dat_c %>%
count(carrier_name) %>%
ggplot(aes(x = reorder(carrier_name, n), y = n)) +
geom_col() +
labs(
title = "Number of canceled flights",
x = "Carrier",
y = "Flights"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Number of canceled flights per carrier from each origin airport:
dat_c %>%
ggplot(aes(carrier_name)) +
geom_bar() +
facet_grid(rows = vars(origin)) +
labs(
title = "Number of canceld flights from each departure airport",
x = "Carrier",
y = "Flights"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
No solution are provided here. Give comments based on your intuition. Is the analysis valid?
This report has been created inside RStudio using R Markdown and the distill format.
The report was built using:
setting value
version R version 4.1.1 (2021-08-10)
os macOS Big Sur 10.16
system x86_64, darwin19.6.0
ui unknown
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Copenhagen
date 2021-09-23
Along with these packages:
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".